This report documents text mining on the 1847 editions of the newspaper Lyna. The textual data used in this report is made accessible through Library Open Access Repository (LOAR) under the Royal Danish Library. For more information about LOAR and Lyna see: https://loar.kb.dk/handle/1902/7736

The Lyna-1847-dataset comes in the form of a CSV-file(Comma Separated Values), which is a format for storing tabular data. In this dataset the structure is thus:

Each row has the following columns:

RecordID The unique id given to each article segment recognised in the digitisation process

sort_year_asc Publication date of the article segment

editionId Unique name given given to newspapers and the date once again

newspaper_page The page in the newspaper that the article segment derives from

pwa measurement for OCR-quality

lvx_facet Internal search facetting

fulltext_org The recognised text from the article segment"

The dataset is processed in the software programme R, offering various methods for statistical analysis and graphic representation of the results. In R, one works with packages each adding numerous functionalities to the core functions of R. In this example, the relevant packages are:

library(tidyverse)
library(tidytext)
library(lubridate)
library(ggplot2)
library(ggwordcloud)

Documentation for each package:
https://www.tidyverse.org/packages/
https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
https://lubridate.tidyverse.org/
https://ggplot2.tidyverse.org/
*https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html

Additional information about R: https://www.r-project.org/

Data import


The dataset is loaded into R. This happens via the read_csv function, which is told the URL pointing to the CSV-file containing the Lyna-data from 1847:

lyna1847 <- read_csv("https://loar.kb.dk/bitstream/1902/7787/1/lyna_1847.csv")

Cleaning the data

Extracting month out of date

In order to see what words where significant to the individual months in 1847 we need to extract the month out of the date column. Thus we create a new column called “month” using the mutate()-function and in this column we place the month from the date column using the month()-function:

lyna1847 %>% 
  mutate(month = month(sort_year_asc)) -> lyna1847


The data processing will be based on the Tidy Data Principle as it is implemented in the tidytext package. The notion is to take text and break it into individual words. In this way, there will be just one word per row in the dataset. However, this poses a problem in relation to proper nouns such as “Frederik Emil August”. Using the tidytext principle, this proper noun will break up “frederik” and “emil” and “august” into separate rows. This results in a loss of meaning because these names on their own do not make sense. “Frederik Emil August” is a semantic unit, which we are destroying when converting to the tidytext-format. It can make sense in some text mining task to cleanse the text data so as “Frederik Emil August” becomes “frederik_emil_august” and thus doesn’t get destroyed in the tidying proces. It is ofcourse impossible to anticipate all semantic units in a text, but in some cases when examining persons and specific entities it would make sense to cleanse before converting to tidy text format. In this case however we will employ a distant reading perspective using the term frequency - inversed document frequency method. Therefore we wont do any cleansing. Alot of talk of the tidytext-format - Let’s see it in action!

Next, we transform the data into the before mentioned tidytext-format which will place each word into a row of its own. This is achieved by using the unnest_tokens function:

lyna1847 %>% 
  unnest_tokens(word, fulltext_org) 


Analysis

Now we will find the words that appear most commonly per month in the Lyna newspapers from 1847.

lyna1847 %>% 
  unnest_tokens(word, fulltext_org) %>% 
  count(month, word, sort = TRUE)

Not surprisingly, particles are the most common words we find. This is not particularly interesting for us in this enquiry, as we want to see which words are specific to the individual months. The particles will appear in all months. The first step is finding a measurement that will allow us to compare the frequency of words across the chapters. We can do this by calculating the word’s, or the term’s, frequency:

\[frekvens=\frac{n_{term}}{N_{month}}\]

Before we can take this step, we need R to count how many words there are in each month. This is done by using the function group_by followed by summarise:

lyna1847 %>% 
  unnest_tokens(word, fulltext_org) %>% 
  count(month, word, sort = TRUE) %>% 
  group_by(month) %>% 
  summarise(total = sum(n)) -> total_words

Then we add the total number of words to our dataframe, which we do with left_join:

lyna1847 %>% 
  unnest_tokens(word, fulltext_org) %>% 
  count(month, word, sort = TRUE) %>% 
  
  left_join(total_words, by = "month") -> lyna1847_counts

Just typing the name of a dataframe in a code chunk and running it will print the dataframe:

lyna1847_counts

Now we have the number we need to calculate the frequency of the words. Below we are calculating the word “und” in May.

\[{frekvens for "und" in 5}=\frac{473}{42249}=0.01119553\]

By calculating the frequency of the terms, we can compare them accross each month. However, it is not terribly interesting comparing the word “und” between months. Therefore, we need a way to “punish” words that occur frequently in all months. To achieve this, we are using inversed document frequency(idf):

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{months}}}{n_{\text{months containing term}}}\right)}\] n is the total number of documents (months, in our example) and N is the number of months in which the word occurs.

Let’s calculate the idf-value for “und”:

\[idf(\text{term}) = \ln{\left(\frac{12}{12}\right)}=0\]

Thus we punish words that occur with great frequency in all months or many months. Words that occur in all the months cannot really tell us much about a particular chapter. Those words will have an idf of 0 resulting in a tf_idf value that is also 0, because this is defined by multiplying \(tf\) with \(idf\):

\[tf\_idf=tf \times idf\] \(tf\_idf\) for the word “und”:

\[tf\_idf=0.01119553 \times 0= 0\]

Luckily, R can easily calculate tf and tf_idf for all the words by using the bind_tf_idf function:

lyna1847_counts %>% 
  bind_tf_idf(word, month, n) -> lyna1847_tfidf

lyna1847_tfidf

Nonetheless we still do not see any interesting words. This is because R lists all the words in an ascending order – lowest to highest. Instead, we will ask it to list them in a descending order – highest to lowest tf_idf:

lyna1847_tfidf %>% 
  arrange(desc(tf_idf))

It is hard to get a proper overview - let’s do a wordcloud for with the 8 highest \(tf\_idf\)-values withing each month!

Visualisation

Many people who have tip their toes in the text mining/data mining sea will have seen wordclouds showing the most used words in a text. I this visualisation we are going to create a wordcloud for each chapter showing the words with the highest tf_idf from that chapter. By doing so we will get a nice overview of what words are specific and important to each chapter. Remember that words which appear alot acros the chapters will not show here.

lyna1847_tfidf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(month) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(label = word, size = tf_idf, color = tf_idf)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 5.5) +
  theme_minimal() +
  facet_wrap(~month, ncol = 3, scales = "free") +
  scale_color_gradient(low = "darkred", high = "red") +
  labs(
      title = "Lyna 1847: most important words pr. month",
       subtitle = "Importance determined by term frequency (tf) - inversed document frequency(idf)")

Because the space for visualisation is constricted in this .Rmd format we have to save the result as a pdf, where we define a larger canvas. Run the last code and look in the files pane to the right. You should find a file called “lyna1847_wordsclouds.pdf”. This is more readable.

ggsave("lyna1847_wordclouds.pdf", width = 20, height = 16, units = "cm")

Examining results - moving between distant and close reading

So let’s say we want to examine what is going on with “frciheit” in December. We recognise that this probably is a OCR-misreading of “freiheit”. In order to see this words in it’s context we turne our eye to our initial dataset, lyna1847. Here we use the filterfunction combined with the str_detect-function så get only the rows where “frciheit” appears in the fulltext_org column:

lyna1847 %>% 
  filter(str_detect(fulltext_org, "frciheit"))

Okay, this context is a bit hard to examine - let’s see the word in it’s original context. We can do this through the Danish newspaper collection’s mediaplatform called Mediestream. Above we have all the relevant information to make a very precise search: Our word of interest, date, and newspaperID - thus our search string becomes:

date:1847-12-15 AND familyId:lyna AND frciheit

Link to the result:

http://www2.statsbiblioteket.dk/mediestream/avis/record/doms_aviser_page%3Auuid%3Ae70bff63-cfb6-489f-87e8-de78a49458a2/query/date%3A1847-12-15%20AND%20familyId%3Alyna%20AND%20frciheit

Wrap up

Contratulations! You have completed your very first text mining task and created an output! You have also succesfully moved between a distant reading and the classical close reading
You are now ready to venture further into the world of tidy text mining. This short introduction was based on the Tidy Text Mining with R-book. Now that you know how to use an R-markdown you can use the book to explore their methods!